Introduction

This tutorial has two major aims: The first one is to show the general workflow of how land cover classifications (or similar tasks) based on satellite data can be performed in R using machine learning algorithms. The second important aim is to show how to assess the area to which a spatial prediction model can be applied (“Area of applicability”, AOA). This is relevant because in spatial predictive mapping, models are often applied to make predictions far beyond sampling locations (i.e. field observarions used to map a variable even on a global scale), where new locations might considerably differ in their environmental properties. However, areas in the predictor space without support of training data are problematic. The model has no knowledge about these environments and predictions for such areas have to be considered highly uncertain.

Prediction task

The example prediction task is to perfom a supervised land cover classification for the Münster in Germany. The dataset to do this includes selected spectral channels of a Sentinel-2 scene. As resposne (reference/ground truth) we use digitized polygons that were created on the basis of expert knowledge.

How to start

For this tutorial we need the terra package for processing of the satellite data as well as the caret package as a wrapper for machine learning (here: randomForest) algorithms. Sf is used for handling of the training data available as vector data (polygons). CAST will be used to account for spatial dependencies during model validation as well as for the estimation of the AOA.

rm(list=ls())
#major required packages:
library(terra)
library(sf)
library(caret)
library(mapview)
library(CAST)
library(tmap)

Data preparation

Load and explore the data

To start with, let’s load and explore the remote sensing raster data as well as the vector data that include the training sites.

Raster data (predictor variables)

sen_ms <- rast("data/sen_muenster.tif")
print(sen_ms)
## class       : SpatRaster 
## dimensions  : 664, 798, 7  (nrow, ncol, nlyr)
## resolution  : 10, 10  (x, y)
## extent      : 400620, 408600, 5753560, 5760200  (xmin, xmax, ymin, ymax)
## coord. ref. : +proj=utm +zone=32 +ellps=WGS84 +towgs84=0,0,0,0,0,0,0 +units=m +no_defs 
## source      : sen_muenster.tif 
## names       :     B02,     B03,     B04,   B08,        B06,        B07, ... 
## min values  :    37.5,   211.5,     1.0,     1,   313.1562,   279.8438, ... 
## max values  : 14582.5, 15787.5, 18767.5, 19929, 10972.9688, 11645.3438, ...

The multi-band raster contains a subset of the optical data from Sentinel-2 (see band information here: https://en.wikipedia.org/wiki/Sentinel-2) given in scaled reflectances (B02-B11). Let’s plot the multi-band raster to get an idea how the variables look like.

plot(sen_ms)

### true color and false color composites
plotRGB(sen_ms,r=3,g=2,b=1,stretch="lin")

plotRGB(sen_ms,r=4,g=3,b=2,stretch="lin")

Calculate the NDVI

As an additional predictor variable, we will calculate the NDVI and directly add it as a new raster band

sen_ms$NDVI <- (sen_ms$B08-sen_ms$B04)/(sen_ms$B08+sen_ms$B04)
plot(sen_ms$NDVI)

Add rasters of coordinates (not mandatory…used to show some things later)

xy <- crds(sen_ms,na.rm=F)
sen_ms$lon <- rast(sen_ms[[1]],vals=xy[,1],names="lon")
sen_ms$lat <- rast(sen_ms[[1]],vals=xy[,2],names="lat")

Vector data (Response variable)

The vector file is read as sf object. It contains the training sites of 7 Land cover classes. These are polygons (33 in total) that were digitized in QGIS on the basis of the Sentinel data and with support of an aerial image and using expert knowledge. They can be regarded here as a ground truth for the land cover classification.

trainSites <- read_sf("data/trainingsites_muenster.gpkg")
print(trainSites)
## Simple feature collection with 33 features and 3 fields
## Geometry type: POLYGON
## Dimension:     XY
## Bounding box:  xmin: 400682 ymin: 5753925 xmax: 408327.9 ymax: 5760113
## Projected CRS: WGS 84 / UTM zone 32N
## # A tibble: 33 × 4
##    ClassID Label         PolygonID                                          geom
##      <dbl> <chr>             <int>                                 <POLYGON [m]>
##  1      11 Planted field         1 ((402789.2 5759326, 402850.9 5759266, 402901…
##  2      11 Planted field         2 ((403154.2 5759459, 403357.6 5759470, 403373…
##  3      11 Planted field         3 ((403124.5 5759788, 403314.6 5759806, 403397…
##  4      14 Inland water          4 ((404809.3 5757129, 404865.6 5757163, 404865…
##  5      14 Inland water          5 ((403729 5756102, 403797.5 5756077, 403770.6…
##  6      14 Inland water          6 ((408139 5759018, 408201.5 5759011, 408189.7…
##  7      14 Inland water          7 ((406992.5 5756470, 407110.9 5756441, 407184…
##  8      14 Inland water          8 ((406334.9 5755042, 406351.1 5755068, 406382…
##  9       4 Grassland             9 ((403302.3 5755592, 403332.2 5755597, 403337…
## 10       4 Grassland            10 ((405602.6 5760078, 405610.9 5760104, 405628…
## # … with 23 more rows

Using mapview’s viewRGB function we can visualize the aerial image channels as true color composite in the geographical context and overlay it with the polygons. Click on the polygons to see which land cover class is assigned to a respective polygon.

viewRGB(as(sen_ms,"Raster"), r = 3, g = 2, b = 1, map.types = "Esri.WorldImagery")+
  mapview(trainSites)

Extract raster information

In order to train a machine learning model between the spectral properties and the land cover class, we first need to create a data frame that contains the predictor variables at the location of the training sites as well as the corresponding class information. This data frame can be produced with the extract function. The resulting data frame contains the predictor variables for each pixel overlayed by the polygons. This data frame then still needs to be merged with the information on the land cover class from the sf object.

extr <- extract(sen_ms, trainSites, na.rm=FALSE)
trainSites$PolygonID <- 1:nrow(trainSites)
extr <- merge(extr, trainSites, by.x="ID", by.y="PolygonID")
head(extr)
##   ID   B02 B03 B04    B08      B06      B07      B11      NDVI    lon     lat
## 1  1 921.5 877 469 4936.0 3456.156 4704.219 1699.156 0.8264570 402775 5759315
## 2  1 925.0 880 486 4918.0 3482.719 4717.906 1696.969 0.8201332 402785 5759315
## 3  1 918.0 878 491 4907.0 3405.031 4582.531 1722.375 0.8180808 402795 5759315
## 4  1 921.0 870 473 4924.0 3570.250 4891.750 1668.125 0.8247174 402765 5759305
## 5  1 908.0 872 454 4899.5 3584.062 4919.188 1657.312 0.8303913 402775 5759305
## 6  1 896.0 856 466 4895.5 3579.688 4897.812 1649.438 0.8261680 402785 5759305
##   ClassID         Label                           geom
## 1      11 Planted field POLYGON ((402789.2 5759326,...
## 2      11 Planted field POLYGON ((402789.2 5759326,...
## 3      11 Planted field POLYGON ((402789.2 5759326,...
## 4      11 Planted field POLYGON ((402789.2 5759326,...
## 5      11 Planted field POLYGON ((402789.2 5759326,...
## 6      11 Planted field POLYGON ((402789.2 5759326,...

In order to speed things up, for this tutorial we will reduce the data. Therefore, from each training polygon only 5% of the pixels will be used for model training. Therefore, from each polygon 5% of the pixels are randomly drawn.

set.seed(100)
trainids <- createDataPartition(extr$ID,list=FALSE,p=0.05)
trainDat <- extr[trainids,]

Basic model training

Predictors and response

For model training we need to define the predictor and response variables. As predictors we can use basically all information from the satellite image as we might assume they could all be meaningful for the differentiation between the land cover classes. As response variable we use the “Label” column of the data frame.

predictors <- c("B02","B03","B04","B08","B06","B07","B11","NDVI")
response <- "Label"

Model training

We then train a Random Forest model to lean how the classes can be distinguished based on the predictors (note: other algorithms would work as well. See https://topepo.github.io/caret/available-models.html for a list of algorithms available in caret). Caret’s train function is doing this job.

# train the model
set.seed(100)
model <- train(trainDat[,predictors],
               trainDat[,response],
               method="rf",
               importance=TRUE)
print(model)
## Random Forest 
## 
## 571 samples
##   8 predictor
##   7 classes: 'Fallow field', 'Grassland', 'Industrial', 'Inland water', 'Mixed forest', 'Planted field', 'Urban' 
## 
## No pre-processing
## Resampling: Bootstrapped (25 reps) 
## Summary of sample sizes: 571, 571, 571, 571, 571, 571, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9752218  0.9693965
##   5     0.9695842  0.9624249
##   8     0.9671307  0.9594081
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.
plot(varImp(model))

Model prediction

To perform the classification we can use the trained model and apply it to each pixel of the raster stack using the predict function. Then we can then create a map with meaningful colors of the predicted land cover using the tmap package.

prediction <- predict(sen_ms,model)
cols <- c("sandybrown", "green", "darkred", "blue", "forestgreen", "lightgreen", "red")

tm_shape(prediction) +
  tm_raster(palette = cols,title = "LUC")+
  tm_scale_bar(bg.color="white",bg.alpha=0.75)+
  tm_layout(legend.bg.color = "white",
            legend.bg.alpha = 0.75)

Tuning, validation, variable selection

The code shown above was a basic example of how to train a model and do a classification. But there is more to it. We have to deal with spatial autocorrelation of the data for model tuning and validation and possibly also during a varible selection.

Train Control

Before starting model training we can specify some control settings using trainControl. For hyperparameter tuning (mtry) as well as for error assessment we use a spatial 3-fold cross-validation. Therefore the training data are split into 3 folds but data from the same polygon are always grouped so that they never occur in both, training and testing. Also we make sure that each fold contains data from each land cover class. CAST’s CreateSpacetimeFolds is doing this job when we specify the polygon ID and the class label.

indices <- CreateSpacetimeFolds(trainDat,spacevar = "ID",k=3,class="Label")
ctrl <- trainControl(method="cv", 
                     index = indices$index,
                     savePredictions = TRUE)

We can visualize the “representativness” of the CV folds with plot_geodist:

traindat_sf <- st_as_sf(trainDat[,-(ncol(trainDat))],coords = c("lon","lat"),crs=crs(sen_ms))
geodist <- plot_geodist(traindat_sf,sen_ms,
              cvfolds = indices$indexOut,
             cvtrain = indices$index)

See how it compares to a random CV:

randomfolds <- createFolds(trainDat$ClassID, k = 3, list = TRUE, returnTrain = FALSE)

geodist <- plot_geodist(traindat_sf,sen_ms,
              cvfolds = randomfolds)

Spatial variable selection

Model training is performed using caret’s train function. However if we want to test, which predictor variables are relevant for making predictions on new spatial locations, we can use the forward feature selection (ffs) function from the CAST package, which is a wrapper around the train function. We specify “rf” as method, indicating that a Random Forest is applied. For model training we reduce the number of trees (ntree) to 75 to speed things up. Note that usually a larger number (>250) is appropriate. We use the Kappa index for validation.

model <- ffs(trainDat[,predictors],
               trainDat[,response],
               method="rf",
               metric="Kappa",
               trControl=ctrl,
               importance=TRUE,
               ntree=75)
plot_ffs(model)

plot_ffs(model,plotType = "selected")

print(model)
## Random Forest 
## 
## 571 samples
##   5 predictor
##   7 classes: 'Fallow field', 'Grassland', 'Industrial', 'Inland water', 'Mixed forest', 'Planted field', 'Urban' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 404, 376, 362 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   2     0.9678546  0.9599684
##   3     0.9645503  0.9557367
##   5     0.9504638  0.9387661
## 
## Kappa was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 2.

Prediction using the tuned model

prediction <- predict(sen_ms,model)
cols <- c("sandybrown", "green", "darkred", "blue", "forestgreen", "lightgreen", "red")

tm_shape(prediction) +
  tm_raster(palette = cols,title = "LUC")+
  tm_scale_bar(bg.color="white",bg.alpha=0.75)+
  tm_layout(legend.bg.color = "white",
            legend.bg.alpha = 0.75)

Model validation

When we print the model (see above) we get a summary of the prediction performance as the average Kappa and Accuracy of the three spatial folds. Looking at all cross-validated predictions together we can get the “global” model performance.

# get all cross-validated predictions:
cvPredictions <- model$pred[model$pred$mtry==model$bestTune$mtry,]
# calculate cross table:
table(cvPredictions$pred,cvPredictions$obs)
##                
##                 Fallow field Grassland Industrial Inland water Mixed forest
##   Fallow field            70         0          0            0            0
##   Grassland                0        21          0            0            0
##   Industrial               0         0         32            0            0
##   Inland water             0         0          0           74            0
##   Mixed forest             0         0          0            0          126
##   Planted field            0         0          0            0            0
##   Urban                    1         0          4            2            0
##                
##                 Planted field Urban
##   Fallow field              0     1
##   Grassland                 1     0
##   Industrial                0     3
##   Inland water              0     1
##   Mixed forest              1     4
##   Planted field            70     0
##   Urban                     0   160

Area of Applicability

We have seen that technically, the trained model can be applied to the entire area of interest (and beyond…as long as the sentinel predictors are available which they are, even globally). But we should assess if we SHOULD apply our model to the entire area. The model should only be applied to locations that feature predictor properties that are comparable to those of the training data. If dissimilarity to the training data is larger than the disimmilarity within the training data, the model should not be applied to this location.

The calculation of the AOA is quite time consuming. To make a bit faster we use a parallelization.

AOA <- aoa(sen_ms,model)
## No trainDI provided. Computing DI of training data...
## Computing DI of newdata...
## Computing AOA...
plot(AOA)

plot(AOA$DI)

plot(AOA$AOA)

The result of the aoa function has two layers: the dissimilarity index (DI) and the area of applicability (AOA). The DI can take values from 0 to Inf, where 0 means that a location has predictor properties that are identical to properties observed in the training data. With increasing values the dissimilarity increases. The AOA has only two values: 0 and 1. 0 means that a location is outside the area of applicability, 1 means that the model is inside the area of applicability. Find more information on how the AOA is derived in Meyer&Pebesma (2020).

The figure above shows the predictions ONLY for the AOA (Locations outside the AOA are shown in grey). We see that the model can be applied to most parts of Münster, however there are some locations (especially in the south-west) that are too different in their predictor properties so that we should exclude those predictions from our prediction result.

Summary

  • This tutorial has shown how to perform a remote sensing based land cover classification in R.
  • We identified the area of applicability (AOA) of the trained model to make sure that we don’t make predictions for locations that model has no knowledge about.
  • We transfered the model to a new area and concluded that a transfer is only possible when the model has knowledge about the new environment. Again, the AOA method was applied to identify the unknown locations.

Get further help

For further help on handling of raster and vector data in R see e.g. https://geocompr.github.io/. More information on the relevance of spatial validation strategies as well on the AOA can be found in recordings from OpenGeoHub (https://www.uni-muenster.de/RemoteSensing/lehre/summer_schools/) or in the Tutorials of the CAST package: https://hannameyer.github.io/CAST/